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(57) Abstract: While SNP-based marker sets and population-level DNA repositories arc approaching sufficient size for whole- 
genome association studies, individual genolyping remains very costly. Pooled DNA tests are a less costly alternative, but uncertainty 
about loss of power due to allele frequency measurement error and population stratification hinder their use. Here we describe how 
to optimize pooled tests as an explicit function of measurement error, and we present family-based tests that eliminate stratification 
effects. We show that identification of functional generic variants and linked markers may be feasible with current-day instruments. 
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FAMILY-BASED ASSOCIATION TESTS FOR QUANTITATIVE TRAITS USING 

POOLED DNA 



Introduction 

5 Association tests of outbred populations are thought to have greater power than 

traditional family-based linkage analysis to identify the genetic variants contributing to 
complex human diseases (Risch and Merikangas, 1996; Ott 1999; Ardlie 2002). A genome 
scan based on allelic association would require approximately 100,000 markers, estimated by 
dividing the 3.3 gigabase human genome by the several kilobase extent of population-level 

10 linkage disequilibrium (Abecasis et al 200 1 ; Reich et al. 2001). Single-nucleotide 

polymorphisms (SNPs) occur at sufficient density to provide a suitable marker set (Collins et 
al. 1997). Furthermore, SNPs in coding and regulatory regions have additional value as 
potential functional variants. 

Individual genotyping remains prohibitively expensive for a genome scan. One 

1 5 method to reduce cost is to pool DNA from individuals with extreme phenotypic values and 
to measure the allele frequency difference between pools (Barcellos et al.,1997; Daniels et 
al., 1998; Fisher et al., 1999; Hill et al, 1999; Shaw et al., 1998; Stockton et al, 1998; Suzuki 
et al, 1998). Initial attention focused on pooled designs for dichotomous traits and case- 
control studies (Risch and Teng 1998). More recently, pooled tests have been discussed for 

20 quantitative traits, a more appropriate model for diseases such as obesity and hypertension. 
In the absence of experimental error, the optimal design for an unrelated population is to 
compare frequencies between pools of the most extreme 27% of individuals ranked by 
phenotypic value, retaining 80% of the information of individual genotyping (Bader et al, 
2001). Experimental sources of error, primarily allele frequency measurement error, degrade 

25 the test power (Jawaid et al, 2002). 

Population stratification poses a second challenge to practical use of pooled tests for 
human populations. Genomic control methods, developed to reduce stratification effects in 
genotype-based association tests (Devlin and Roeder 1999; Pritchard and Rosenberg 1999; 
Pritchard et al. 2001 ; Zhang and Zhou, 2001), are not directly applicable to pooled tests. 

30 Here we present optimized pooled DNA test designs, including family-based tests 

robust to stratification. Estimates of test power explicitly include allele frequency 
measurement error. This distinguishes our treatment from prior theoretical work, permits the 
optimization of test design as a function of known parameters, and provides a bridge to 
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experimentalists seeking practical guidance for whether to attempt and how to perform 
pooled association tests. 

Summary of the Invention 

The invention is drawn to a method for detecting an association in a population of 
5 unrelated individuals between a genetic locus and a quantitative phenotype, wherein two or 
more alleles occur at the locus, and wherein the phenotype is expressed using a numerical 
phenotypic value whose range falls within a first numerical limit and a second numerical 
limit. This method comprises the steps of: 

a) obtaining the phenotypic value for each individual in the population; 
10 b) determining the minimum number of individuals from the population required for 

detecting the association using a non-centrality parameter; 

c) selecting a first subpopulation of individuals having phenotypic values that are 
higher than a predetermined lower limit and pooling DNA from the individuals in the first 
subpopulation to provide an upper pool; 
1 5 d) selecting a second subpopulation of individuals having phenotypic values that are 

lower than a predetermined upper limit and pooling DNA from the individuals in the second 
subpopulation to provide a lower pool; 

e) for one or more genetic loci, measuring the frequency of occurrence of each allele 
at said locus in the upper pool and the lower pool; 
20 f) for a particular genetic locus, measuring the difference in frequency of occurrence 

of a specified allele between the upper pool and the lower pool; and 

g) determining that an association exists if the allele frequency difference between the 
pools is larger than a predetermined value. 

In one embodiment of the invention, the difference in frequency of occurrence of the 
25 specified allele has associated with it an error of measurement. In one aspect of the invention 
the error of measurement is 0.04. In another, the error of measurement is 0.01 . 

In another embodiment of the invention, the predetermined lower limit is set so that 
the upper pool ranges from including the highest 37% of the population to including the 
highest 1 9% of the population and the predetermined upper limit is set so that the lower pool 
30 ranges from including the lowest 37% of the population to including the lowest 19% of the 
population. In another aspect of the invention, the predetermined lower limit is set so that the 
upper pool includes the highest 27% of the population and the predetermined upper limit is 
set so that the lower pool includes the lowest 27% of the population. 
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In another embodiment of the invention, the genetic locus has two alleles. 
In another embodiment of the invention, the population includes individuals who may 
be classified into classes. In one aspect of the invention, the classes are based on an age 
group, gender, race or ethnic origin. In another aspect of the invention, the members of a 
5 class are included in the pools. 

In another embodiment of the invention the method is used for determining the 
genetic basis of disease predisposition. 

In another embodiment of the invention, the genetic locus which is analyzed for 
determining the generic basis of disease predisposition contains a single nucleotide 
1 0 polymorphism. 

Brief Description of the Drawings 

Figure 1 . The information retained by the between-family pooled test design, 
expressed as a fraction of the information from individual genotyping followed by a between- 
family test, is depicted sibships of size 4, 2, and 1, each population having 1000 total 

15 individuals. The optimal pooling fraction, indicated by an arrow, shifts to lower values as the 
number of sibs per family decreases. The optimal fraction and corresponding information 
retained also shift to lower values as the minor allele frequency decreases, with results shown 
for frequencies 0.1 and 0.01. The raw measurement error is 0.01. 

Figure 2. The optimal number of sibs to select from each family (top panel) and 

20 the information retained relative to individual genotyping (bottom panel) are shown for 
sibship sizes 2-5, 6, 8, 16, and 32 as a function of the scaled measurement error k. For 
sibships through 5, it is always optimal to select just the highest and lowest sib. 

Figure 3 . The optimal fraction of families to select (top panel) and information 
retained (lower panel) are displayed for sibships of size 2 through 6 as a function of the 

25 scaled measurement error k. 

Figure 4. The optimal pooling fraction (top panel) and information retained 
(bottom panel) for between- family and within-family tests of a population of 500 sib-pairs 
are shown as a function of raw measurement error for marker frequencies 0.5 and 0.01 . The 
within-family tests include pre-selection of discordant-like families. 

30 Figure 5. The optimal pooling fraction (top panel) and the information retained 

(bottom panel) from exact numerical calculations (solid line) and an analytical fit (dashed 
line) are displayed as a function of the normalized measurement error k. The fit coincides 
with the exact results for the information retained. 
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Detailed Description 

We present optimized designs for pooled DNA tests conducted on a population of N/s 
families, each a sibship of size s (N total individuals). The genotypic correlation within a 
sibship is denoted r y with typical values of 1/4, 1/2, and 1 for half-sibs, full-sibs, and 
5 monozygotic twins. Sibships may also represent inbred lines; in this case, r is the genetic 
correlation within each line. Sibs in different families are assumed to have uncorrelated 
genotypes. 

To conduct a pooled DNA test for association of a particular allele A\ with a 
quantitative trait, individuals are selected for an upper pool, comprising higher phenotypic 

10 values, and a lower pool, comprising lower phenotypic value, using designs reminiscent of 
selection strategies for optimizing breeding value and for QTL mapping (Hill 1971; Kimura 
and Crow 1978; Ollivier et al. 1997). We restrict attention to balanced designs in which each 
pool has JN individuals, with / < 0.5 defined as the pooling fraction. Balanced designs are 
favored when high and low phenotypes are treated symmetrically; asymmetry can favor 

1 5 unbalanced designs (Jawaid et al., 2002). 

We consider four designs: (i) unrelated individuals (s = 1), in which the JN individuals 
having highest and lowest phenotypic values are selected for the upper and lower pools 
respectively; (ii) between-family, in which all s sibs from the JN/s families having highest and 
lowest mean phenotypic values are selected for the upper and lower pools; (iii) within-family, 

20 in which the s' sibs having highest and lowest phenotypic values within each family are 

selected for the upper and lower pools, yielding a pooling fraction / = s'/s; (iv) within-family 
with pre-selection of discordant families, in which a fraction/ of families with greatest 

within-family phenotypic variance are selected, Var = ]T ^ {x s - X) where X s is the 

phenotype of sib s and X is the family mean, then the extreme high and low sib within each 
25 selected family are selected for the upper and lower pool for a final pooling fraction / —fIN. 
A suitable statistic for a two-sided test for each design is 

Varfo,-^)' 

where the estimated frequencies of allele A\ in the upper and lower pools are 
denoted p u and p L . The variance is the sum of three terms, Var(^y - p L ) = V s + V c + V M . 
30 The sampling variance V s represents the unavoidable error in estimating the population 
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frequency ftom a finite sample. The concentration variance V c arises from sample-to-sample 
DNA concentration variance within a pool. The measurement variance is V M = 2s 2 , where e 
is the experimental allele frequency measurement error for each pool. We assume that the 
three sources of variation are independent, which should be justified when individual and 
5 pooled DNA samples are treated uniformly. In an ideal experiment, V c and V M vanish, and 
the total variance is V s . 

Under the null hypothesis, Z 2 has a % 2 distribution with one degree of freedom. 
Under the alternate hypothesis, the tested marker is assumed to be a bi-allelic quantitative 
trait locus (QTL) with alleles A\ and Ai occurring at frequencies p and (l - p) = q . For 
10 between-family tests, the alleles are assumed to be in Hardy- Weinberg equilibrium and the 
population is assumed to be random mating; these assumptions may be relaxed for within- 
family tests. The variance of the allele frequency per individual is a 2 = pql 2 . For each 
i 

design, the allele frequency is estimated as /? = {p u + p L )/2 . The estimated variance of the 
allele frequency per individual is denoted a 2 and equals p(l - p)/2 . 
1 5 The mean phenotypic effects are m G = a, d, and - a for genotypes G = A X A { , A X A 2 , 

and A 2 A 2 , respectively. The dominance ratio dja describes the inheritance mode with 
typical values -1,0, and 1 for pure recessive, additive, or dominant inheritance. The 
proportion of trait variance accounted for by the QTL is denoted a g , 

cr 2 = 2 pq[a -d{p- q)f + [ipqdf = a\ + a 2 . 

20 The mean QTL effect is m = (p - q)a + Ipqd . Phenotypic values are assumed to tie 

normally distributed for each genotype with mean \i G = m G - m and residual variance 

<j\ = 1 - a q arising from all genetic and environmental factors other than the QTL. The 

distribution of phenotypic values in the population is a mixture of three normal distributions 
with overall mean 0 and variance 1 . The total phenotypic correlation between sibs from 
25 genetic factors (including the QTL) and environmental factors is termed /. 
1 

The non-centrality parameter (NCP), 
NCP = [E{p v -jOf/Varfo, -p L ) , 
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measures the information provided by a pooled DNA test. The notation e(<?) is the 

t A 

expectation of an observable O . The approach followed below is to evaluate the numerator 
of the NCP as a function of the model parameters, providing accurate analytical results when 
possible and simulation results otherwise. For the denominator of the NCP, analytical results 
are obtained for the null hypothesis. For the alternative hypothesis, the expected allele 
frequencies for each pool have offsetting changes from p to p ± bp (see Methods for 
derivation), and the value of the denominator decreases by a small value proportional to 

. We make a conservative approximation by ignoring the change and using the null 
hypothesis denominator for the alternative hypothesis as well. In this case, the NCP equals 
( z a/2 ~ 2 i-u f » where a and P are the type I and II error rates for the two-sided test. 
Maximizing the NCP optimizes the test. 

The denominator of the NCP is shown in the Methods to have the form 



25 



2Ga 2 2t 2 <j 2 2a 2 , \ 

V S +V C +V M = L + L + 2s 2 =^.(g + t 2 • 

M Nf Nf . Nf K ' 

2a 



1 + 



Nfe 2 



=^.(g + ^(i + A j ) 

where x is the coefficient of variation for DNA concentration. The constant G depends 
15 only on the family structure and equals 1 for pools of unrelated individuals, sR for the 
between-family design, and for both within-family designs; the standard notation R 
relates the sib genotypic correlation r to family-based variance components, 

Typically % is less than 10%; t 2 may usually be ignored relative to G. The term k 2 is 
20 used as shorthand, 

k 2 =8 2 I[(g + t 2 )$ 2 Jn\. 

Referred to as the scaled measurement error, k represents the raw measurement error, 
e, scaled by the remaining sources of error in the allele frequency difference. In practice, k 
can be calculated prior to pooling because it depends on known quantities. 



The numerator of the NCP is shown in the Methods to have the form 
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where $(z) is the nonnal density (2ti)" i/2 exp(- z 2 12), 0(z) is the cumulative normal 
probability and its functional inverse. The constant F equals 1 for pools of unrelated 

individuals, i^/Jfor between-family pools, and (l-R) 2 /(\-T) for within- family pools without 
5 pre-selection. For the within-family design using discordant-like pre-selection, F = (1— 
r) 2 /2(l-/)forsib -pairs (expressions for larger sibships are unwieldy). The term R has the 
same definition as before, and T is the standard factor relating the sib phenotypic correlation t 

to family-based variance components, T = — •[! + ($ 

s 

Combining terms, the analytical result for the NCP, valid for small QTL effect, is 

G\ G + T 2 / + /V 

The first factor is identical to the NCP for an association test performed by individual 
genotyping on a population of TV unrelated individuals; the second factor, with t = 0, is the 
correction for individual genotyping a population of N/s families each having s sibs and then 
performing either a between-family test, with F/G = R/sT, or a within-family test, with 

1 5 FIG - (s-l)R/s(l-T). The third factor represents the fraction of information retained when 
the association test is performed by pooling instead of individual genotyping, and 
maximizing this factor with respect to the pooling fraction/ provides the optimal pool size. 
When the measurement error s = 0, tests are optimized with /= 0.27 and 80% of the 
information is retained (Bader et al. 2001). As e increases, the maximum information that 

20 can be retained is determined entirely by the single collective term k. 

Expressions for G, and k 2 are summarized in Table I, and we now provide 
examples of each family-based design. Information retained by the between-family design is 
depicted in Fig. 1, with results for 3 sibship sizes: sib-quads, sib-pairs, and unrelated 
individuals, each population having 1000 total individuals. The optimal pooling fraction, 

25 indicated by an arrow, shifts to lower values as the number of sibs per family decreases. The 
optimal fraction and corresponding information retained also shift to lower values as the 
minor allele frequency decreases, with results shown for frequencies 0.1 and 0.01 . The raw 
measurement error s = 0.01 in this example, and the pooling fraction and information retained 
would decrease for larger s (see Fig. 4 for examples of changing e). 
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In Fig. 2, the optimal number of sibs to select from each family (top panel) and the 
information retained relative to individual genotyping (bottom panel) are shown as a function 
of the scaled measurement error k for sibship sizes of 2-5, 6, 8, 16, and 32. For sibships 
through 5, it is always optimal to select just the highest and lowest sib. For larger families 
5 and small measurement error, the top and bottom quarters of the sibs are pooled and 80% of 
the information is retained. The pooling fraction and information decrease as the 
measurement error increase. 

Within-family tests can be improved by pre-selection of discordant-like families, as 
shown in Fig. 3. The optimal fraction of families to select (top panel) and information 
10 retained (bottom panel) are displayed for sibships of size 2 through 6 as a function of the 
scaled measurement error k (results determined by computer simulation). The fraction of 
families and information retained both decrease as k increases. Discordant pre-selection has 
the greatest benefit for sib-pairs: for the smallest values of k, only 56% of families are 
selected, retaining 80% of the information; had all families been used, only 60% of the 
1 5 information would have been retained. Pre-selection is less important for trios and larger 
sibships. 

In Fig. 4, the optimal pooling fraction (top panel) and information retained (bottom 
panel) using between-family pools and using within-family pools with discordant-like pre- 
selection are displayed for a population of 500 sib-pairs (1000 individuals) as a function of 

20 the raw measurement error e. Results are shown marker frequencies 0.5 and 0.01. With no 
measurement error, the optimal pooling fraction of 0.27 retains 80% of the information in 
each case. As measurement error increases, the optimal pooling fraction decreases, as does 
the information retained. 

The information loss increases for rarer alleles and is worse for the within- family test 

25 than for the between-family test. This behavior can be deduced from the scaled error k 2 , 

which is inversely proportional to the allele frequency sampling variance. Since the sampling 
variance is 3x smaller within-family vs. between-family, k 2 is 3x larger, 4JVe 2 / p(l - p) vs. 
4#e 2 /3/?(l-/>),and more information is lost. The inverse dependence of k 2 on the allele 
frequency explains the decrease in power for rare alleles. 

30 Because the allele frequency difference between sibs is uncorrelated from their allele 

frequency mean, the between-family and within-family tests are independent estimators of Ca 
even when individuals contribute their DNA under both designs. The NCP of a combined 
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test is the sum of the NCPs for each test and it too follows a % 2 distribution with 1 degree of 
freedom. In practice, estimates for o A may obtained by inverting the expressions for 
^{Pu ~Pl) provided in Table I, then weighting each estimator by the inverse of its variance. 
Population stratification may be indicated by a difference between the estimates for 
5 a A from a between-family and within-family test. In the absence of stratification, the 
difference follows a normal distribution with variance 

Varfo + -a A _]= V + ■ \flTv\ /4>^] + V. ■ [f^l-T^ I4yl{\ -R)^} 
where the and "-" subscripts refers to the between-family and within-family 
designs respectively, y ± = <j> [or 1 (l - f ± )] , and V represents the total variance, V S +V C +V M , 

10 for each design. When stratification is indicated, the between-family estimate of a a may be 
unreliable but the within-family estimate remains robust. 

In Fig. 5, the optimal pooling fraction (top panel) and the information retained 
(bottom panel) are displayed as a function of the scaled measurement error k. The 
information retained is calculated assuming no concentration variance. In addition to the 
1 5 numerically calculated results, an accurate fit is shown using the functional form 
/ = 1 - ®[A - (3/^)ln A - 0.067], with 

4/c) = ^2 + ]n^l + 3^ 2 +^K 4 j 

A justification for this functional form is provided in the Methods. The greatest 
deviations for the pooling fraction are at k = 0.5, where the fit yields a pooling fraction that is 

20 0.006 too high, and at k = 3.5, where the fit is 0.01 too low. The information retained using 
the analytical value for the pooling fraction coincides with the exact numerical results on the 
scale of the figure. The experimental measurement error s corresponding to the scaled error 
k depends on the population structure and marker frequency. For example, for a population 
of 500 cases, 500 matched unrelated controls, and 10% marker frequency, 8 = 0.0067k is the 

25 raw error corresponding to k. 

Based on the pooled designs described above, we outline a prospective study using 
100,000 markers to detect QTLs with a 1% effect. If 100 false-positives are permitted from 
pooled tests (the false -positives may be resolved using individual genotyping) and 80% 
power is required, the NCP is 17. We assume pooling of discordant sib-pairs to protect 

30 against stratification effects. At the scaled error k = 1 where the pooled tests are still close to 
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maximum power, the pooling fraction would be 21%, 65% of the information of a population 
would be retained, and a population of 2600 individuals would be required. The raw 
measurement error corresponding to k = 1 for this population size is 0.005 for an allele with 
50% frequency and 0.002 for an allele with 5% frequency, 5x tolOx more precise than 
5 achieved by current-day instrumentation. 

We can account for current-day precision by setting k = 10 , which from Fig. 5 is seen 
to retain 7.7% of the information and corresponds to a pooling fraction of 1 .6% of a total 
population of 22,000. In this case, the precision required for a pooled test is 0.017 for an 
allele with 50% frequency and 0.007 for an allele with 5% frequency. These precisions are 

1 0 within the range of current performance, especially if repeated measures are used to decrease 
the effective measurement error. The cost to collect and score such a population for multiple 
disease-related phenotypes would be under $50 million. Selection schemes could then be 
applied to . generate pools for each phenotype in turn. 

As noted previously, pooled tests perform worse for within-family tests and rare 

1 5 alleles, and may therefore be difficult to apply to disease-risk variants under negative 

selection pressure. The loss of power may be less severe for pharmacogenetic studies of 
variants affecting drug response, where selection pressure is absent, and for test crosses of 
model organisms (Grupe et al. 2001) or agricultural species whose marker frequencies are 
under experimental control. 

20 The analysis provided here for quantitative traits may be extended to threshold 

characters yielding dichotomous classifications of a population. For case-control 
classification, the disease prevalence corresponds to the pooling fraction /. When the 
quantitative character is available for measurement, it is approximately 4x more efficient to 
compare unrelated individuals with extremely high vs. extremely low characters than to 

25 compare the derived cases vs. controls (Bader et al. 2001). 

In summary, we have derived the optimal pooling fractions for within-family and 
between-family tests of association. With ideal instrumentation, 80% of the information is 
retained and the optimal pooling fraction is 27%. As allele frequency measurement error 
increases, the optimal pooling fraction and the information retained both decreases. The 

30 information loss is more severe for low-frequency alleles and for within-family tests. The 
optimal pooling fraction depends on a single parameter representing the measurement error, 
and optimized pooling designs are provided as a function of this parameter. 

Examples 
10 
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Example 1: Sampling variance and concentration variance 

Let pi represent the frequency of allele A\ for individual z, either 0, 1/2, or 1, and c, 
represent the concentration of DNA contributed by this individual to a pool of n individuals. 
Neglecting measurement eiror, the allele frequency p* for the pool is 

* _ y c iPi . n ,y (c 0 +&,)», 

> -+ — L h — S 

w n 

which defines the relative concentration error 5c[ . The terms 5p f and 8c,' are 
uncorrelated, and each has expectation zero. Furthermore, the sum of the 8c! terms is 
constrained to be zero. The variance of p* is 

Var(p•)=4S^* ^ >*y) + 4ZM&^&;)Cov(^ ^ ,* / ) 
" ij n u 

1 2 
I 2 I" 2 



10 We have used 



Coy(fy i9 Spj)= £^ p K g =a 2 p r {J and 
Cov(&;,&;)=r 2 ^-ij«r 2 5 l7 , 



with the concentration coefficient of variation defined as x s [Var(c. )] 1/2 /c 0 and the 
genotypic correlation between a pair of individuals defined as r tj . 
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For the between-family design, a pool of n individuals contains n/s sibships of size s and 
genotypic correlation r. The result for Var(p*) is 

Var(/) = ^ + l^, 
n y n 

with R = (l/-y)[l + (s - l)r] . Since the individuals in the upper and lower pools are 
5 unrelated, V s + V c = 2 Var(p* ) . 

For a within-family design, the allele frequency difference between pools is 

AP , =^S(i+&;)* l ~St+&;)sp y , 

where z and / label individuals in the upper and lower pools respectively. The 
variance is 

Var(V)=4lCov(^ 

10 71 iJ ' n u 

2(1 -r) 2 2r 2 2 

= — — l <j\ + a 2 . 

n p n p 

Example 2: Expected allele frequency difference and non-centrality parameter 

The genotype-dependent phenotype distribution is defined using a variance 
components model, 

15 Family and individual effects are normally distributed with mean zero and variance 

Var(7j = /-r^- M ^ 
VaiiY^crl-t + rai+ucjl 

The family index is k, the sib index is i, and the individual phenotypes are the sum 
of Y k , the family effect excluding the QTL, Y H , the individual effect excluding the QTL, and 
}i u , the QTL effect }i(G u ) for sib L The total phenotypic correlation between sibs is /. Both 
20 r and u relate to the genetic background shared between sibs, r being the genotypic 

correlation (1 for monozygotic twins, 1/2 for full sibs, 1/4 for half sibs) and u being the 
shared genotype expectation (1 for monozygotic twins, 1/4 for full sibs, 0 for half sibs) 
(Falconer and Mackay 1996). 
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The observed phenotypes X u are re-expressed as family means and individual 
deviations from family means, 

S i 

hX H = X ti - X km 
Similar quantities are defined for the QTL effects, 

-> s i 

and the variances of the observed quantities excluding QTL effects are 

Vafe - Mt .h-k +{s-\)(t-rol -ua^mTa\ 
s 

When the QTL effects are small, r*(l/4 1 + (•*-!>] is an accurate approximation. 
The probability that sibling 1 from family k with genotypes G = (G x , G 2 , . . . , G s ) is 
10 selected for the upper pool is 1 - ®[(X f - Mg )/o ], where 0(z) is the cumulative normal 

probability. The variable under selection, denoted^, is either X km (between-family pools) or 
5X k] (within-family pools); \x G is either ji*. (between-family pools) or 8ji tI (within-family 
pools); the variance of X - u. G is a 2 , either Tu \ (between-family pools) or (l - T)j 2 r 
(within-family pools) ; and X is the selection threshold applied toX. Because the labeling of 
15 sibs is arbitrary, the fraction / of individuals selected for pooling is equal to the probability 
that sib 1 is selected, i.e. the probability that X is greater than the selection threshold, 

/ = XPr(G){l-0[(^-^)/ai 

G 

where Pr(G) is the probability of observing the sibship genotypes G. 
To calculate the allele frequency of the selected individuals, the threshold X is 
20 required as a function off. Numerical inversion may be applied to the above equation. 
Alternatively, when the QTL effect is small (jig < cr), the linear approximation 

o[(r-/i G )/<7]« a>(r/<r)-ta/aM*'/<0 

13 
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is accurate, where <j> (z) = dO(z)/dz is the normal probability density. The terms 
linear in u. G cancel in the sum over G, yielding / = 1 - <3}(X'/o ) . 
The expected allele frequency of the resulting pool is 

J G 

5 where p G represents the allele frequency of sib 1. Using the linear expansion for 

<S>[(X 1 - Mg )/<j] yields 

E(A,) = £Pr(Gk + <^£Pr(Gk / , G = / , + ^M E (^ c ). 
g fa c fcr 

An analogous expression for the lower pools gives a symmetric result, yielding 

1 0 where X '/a has been replaced by <D _1 (l - /) . 

The expectation of the correlation between p and p. for an individual is 

B{pM) = p 2 [a-{p-q)a-2pqd]+2pq^-[d^{p-q)a-2pqd] 
= pq[a-{p-q)d] 

Similarly, the correlation between sibs i andy is E(p,// y .) = r {j cr p o A , where tv is their 
genotypic correlation. Summing over sibs yields either Ra p o A (between-family pools) or 
15 (l - R)cr p <J a (within-family pools) for E(p G jli c ) , with R = (l/ s)[l + (s - l)r] as before. 

Selecting discordant-like sib-pairs is equivalent to selection based on [SA^.) , and the 

within-family analytical results are directly applicable. For larger families, discordant-like 
families are pre-selected in decreasing rank order of the within-family phenotypic variance 
2] j 5A r Jb 2 summed over siblings s. 

20 We have ascertained that the analytical results for the NCP are virtually 

indistinguishable from exact numerical results when the QTL effect is 5% or less of the trait 
variance. For larger effects, roughly when the effect size 0/ approaches the minor allele 
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frequency, the genotype-dependent phenotype distributions become resolved, transforming a 
complex trait into Mendelian trait amenable to traditional linkage analysis. 

Example 3: Analytical fit for the optimal pooling fraction 

Optimizing the pooling fraction is equivalent to maximizing the objective function 
5 I = 2y 2 /(f + f\ 2 ) 9 where y is shorthand for <|> [<D M (l - /)] . Writing/as 1 - O(z) and 
optimizing usingdl/dz = 0 yields 

y-(l + 2 J fr 2 )-2z/-(l + A 2 )=0. 
We have used y - <j>(z), dy/dz = -yz , and df/dz = -y. 

When k 2 is large, z is also large, and / may be replaced by its asymptotic expansion for 
1 0 large z, / = y • (z" 1 - z~ 3 ). With this substitution, the optimum satisfies 



2yK 2 



Taking the natural logarithm of both sides and equating exponents, 
Y + 3 In z - ln(xr 2 -fifti ) s j(z) = 0 . 

When k and z are both large, the term 3 In 2 is asymptotically small, giving 
15 2r~Vbi(2K-7^) = S(K). 

An improved fit is obtained by perturbation theory by writing 

Z = 5(a:)[1 + 6(k)], 

where lim &(k ) = 0 . Substituting this expression for z into J{z) and simplifying, 

B 2 b + 3\n[B(l + b)]=0, 

20 which gives the asymptotic form b = fe/B 2 )]nB , or 

z~5-(3/3)ln5. 

For clarity, the functional dependence of B and b on k has been suppressed. 
The asymptotic form provides a good fit when k is much larger than 1 but not for 
smaller values. Since the asymptotic behavior for large k is not affected by introducing terms 
25 of lower order in k, the fit can be improved for small k without degrading the fit at large k by 
writing 
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z = A - (3/i4)ln ^ + a, , where 

4«r) = ^ 2 + In^l + a 3 K 2 + 4 j . 

The constants ai, a 2 > and a 3 are then selected to fit the exact numerical results at 
particular values of k. Fitting the results z = 0.612 at k = 0 and z = 0.8047 at k = 1 provides 
5 the particular parameters 

a x = -0.067, a 2 =2, a 3 =3. 



Table L The non-centrality parameter for family-based pooled DNA designs ! 

Design F G 

Unrelated individuals 1 1 

Between-family i? 2 /r sR 

Within-family ( x _ R y /(l - j) 1 - r 

Within-family, discordant ^ _ r f jji^ -t) 1 - r 
pre-selection b 



10 a The non-centrality parameter (NCP) is fe(pu ~ Pl)Y / v ™(pu ~Pl)' The 

numerator is F • (^crjcr^fo" 1 (l - f)f /a If 2 ), where F is provided for each design,/is the 
pooling fraction, a \ and a 2 R are the additive and residual variance for a QTL with allele 
frequency p, a 2 is - p)/2 , <|> (z) is the normal probability density and <D(z) is the 
cumulative normal probability. The denominator of the NCP is \2.{G + T 2 )cr 2 p /Nf]+2B 2 , 

1 5 where G is provided for each design, t is the coefficient of variation for DNA sample 

concentrations in the pool, N is the total number of individuals before selection, and e is the 
raw measurement error. The combined expression for the NCP is 

(^/^)-[F/(G + rO]-M <1) " , ( 1 -/)]V(> + / 2 K 2 )}»^eK 2 is Ne> /[(g + r^] and 
k is termed the scaled error. Each sibship has 5 sibs with genotypic correlation r and 
20 phenotypic correlation t\ R and T are (l/s)[l + (s - l)r] and (l/s)[l + (y - l)t] , respectively. 

b Analytical results are for sib-pairs only. For larger families see numerical results 
(Fig- 3). 
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1. A method for detecting an association in a population of unrelated individuals 
between a genetic locus and a quantitative phenotype, wherein two or more alleles occur at 
the locus, and wherein the phenotype is expressed using a numerical phenotypic value whose 
range falls within a first numerical limit and a second numerical limit, the method comprising 
the steps of 

a) obtaining the phenotypic value for each individual in the population; 

b) determining the minimum number of individuals from the population required for 
detecting the association using a non-centrality parameter; 

c) selecting a first subpopulation of individuals having phenotypic values that are 
higher than a predetermined lower limit and pooling DNA from the individuals in the first 
subpopulation to provide an upper pool; 

d) selecting a second subpopulation of individuals having phenotypic values that are 
lower than a predetermined upper limit and pooling DNA from the individuals in the second 
subpopulation to provide a lower pool; 

e) for one or more genetic loci, measuring the frequency of occurrence of each allele 
at said locus in the upper pool and the lower pool; 

f) for a particular genetic locus, measuring the difference in frequency of occurrence 
of a specified allele between the upper pool and the lower pool; and 

g) determining that an association exists if the allele frequency difference between the 
pools is larger than a predetermined value. 

2. The method of claim 1, wherein the difference in frequency of occurrence of the 
specified allele has associated with it an error of measurement. 

3. The method of claim 2, wherein the error of measurement is 0.04. 

4. The method of claim 2, wherein the error of measurement is 0.01. 

5. The method described in claim 1, wherein the predetermined lower limit is set so 
that the upper pool ranges from including the highest 37% of the population to including the 
highest 19% of the population and the predetermined upper limit is set so that the lower pool 
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ranges from including the lowest 37% of the population to including the lowest 19% of the 
population. 



6. The method of claim 1, wherein the predetermined lower limit is set so that the 
upper pool includes the highest 27% of the population and the predetermined upper limit is 
set so that the lower pool includes the lowest 27% of the population. 

7. The method of claim 1, wherein the genetic locus has two alleles. 

8. The method of claim 1 wherein the population includes individuals who may be 
classified into classes. 

9. The method of claim 8, wherein the classes are based on an age group, gender, 
race or ethnic origin. 

10. The method of claim 8, wherein all the members of a class are included in the 

pools. 

11. The method of claim 1 for determining the genetic basis of disease predisposition. 

12. The method of claim 11, wherein the genetic locus which is analyzed for 
determining the genetic basis of disease predisposition contains a single nucleotide 
polymorphism. 
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Figure 1 
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Figure 3 
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